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In the Claims: ^ e 

1 . (Currently Amended) A method for detecting similar documents comprising the steps 

of: 

obtaining a document; 

filtering the document to obtain a filtered document; 

determining a document identifier for the filtered document and a single hash value for 
the filtered document; 

generating a single tuple for the filtered document, the tuple comprising the document 
identifier for the filtered document and the hash value for the filtered document; 

comparing the tuple for the filtered document with a document storage structure 
comprising a plurality of tuples, each tuple in the plurality of tuples representing one of a 
plurality of documents, each tuple in the plurality of tuples comprising a document identifier and 
a hash value; and 

determining if the tuple for the filtered document is clustered with another tuple in the 
document storage structure, thereby detecting if the document is similar to another document 
represented by the another tuple in the document storage structure. 

2. (Original) A method as in claim 1, wherein the step of filtering comprises parsing the 
document, and wherein the filtered document comprises a token stream, the token stream 
comprising a plurality of tokens. 

3. (Original) A method as in claim 2, wherein the step of filtering further comprises 
retaining a token in the token stream as a retained token according to at least one token threshold. 

4. (Original) A method as in claim 3, wherein the step of filtering further comprises 
arranging the retained tokens in the token stream to obtain an arranged token stream. 
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5. (Original) A method as in claim 3, wherein the step of determining the hash value for 
the filtered document comprises determining the hash value by processing individually each 
retained token in the token stream. 

6. (Original) A method as in claim 2, wherein the step of filtering further comprises: 
determining a score for each token in the token stream; 

comparing the score for each token to a first token threshold; and 

modifying the token stream by removing each token having a score not satisfying the first 
token threshold and retaining each token as a retained token having a score satisfying the first 
token threshold. 

7. (Original) A method as in claim 6, wherein the step of filtering further comprises: 
comparing the score for each retained token to a second token threshold; and 
modifying the token stream by removing each retained token having a score not satisfying 

the second token threshold and retaining each retained token having a score satisfying the second 
token threshold. 

8. (Original) A method as in claim 2, wherein the step of filtering further comprises 
removing from the token stream at least one token corresponding to a stop word. 

9. (Original) A method as in claim 2, wherein the step of filtering further comprises 
removing a token from the token stream if the token is a duplicate of another token in the token 
stream. 

10. (Previously Amended) A method as in claim 2, wherein the step of filtering further 
comprises removing a token from the token stream based on collection statistics and at least one 
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token threshold. 

1 1 . (Original) A method as in claim 2, wherein the step of filtering comprises removing 
at least one token from the token stream. 

12. (Original) A method as in claim 1, wherein the step of filtering comprises removing 
formatting from the document. 

13. (Original) A method as in claim 1, wherein the step of filtering uses collection 
statistics for filtering the document. 

14. (Original) A method as in claim 13, wherein the collection statistics pertain to the 
plurality of documents. 

15. (Original) A method as in claim 1, wherein the step of determining the hash value for 
the filtered document comprises using a hash algorithm to determine the hash* value, the hash 
algorithm having an approximately even distribution of hash values. 

16. (Original) A method as in claim 1, wherein the step of determining the hash value for 
the filtered document comprises using a standard hash algorithm to determine the hash value. 

17. (Original) A method as in claim 1, wherein the step of determining the hash value for 
the filtered document comprises using a secure hash algorithm to determine the hash value. 

18. (Original) A method as in claim 1, wherein the step of determining the hash value for 
the filtered document comprises using hash algorithm SHA-1 to determine the hash value. 
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19. (Original) A method as in claim 1, wherein the document storage structure comprises 
a hash table. 

20. (Original) A method as in claim 1, wherein the document storage structure comprises 

a tree. 

21 . (Original) A method as in claim 20, wherein the tree comprises a binary tree. 

22. (Original) A method as in claim 21 , wherein the binary tree comprises a binary 
balanced tree. 

23. (Original) A method as in claim 1, wherein the document storage structure comprises 
a hash table and at least one tree. 

24. (Original) A method as in claim 1, wherein the step of comparing comprises inserting 
the tuple into the document storage structure. 

25. (Original) A method as in claim 1, wherein the document storage structure comprises 
a hash table, the hash table comprising a plurality of bins, each bin of the hash table comprising 
at least one tuple of the plurality of tuples, and 

wherein the step of determining if the tuple is clustered with another tuple comprises 
determining if the tuple is co-located with another tuple at a bin of the hash table. 

26. (Original) A method as in claim 1, wherein the document storage structure comprises 
a tree, the tree comprising a plurality of branches, each bucket of the tree comprising at least one 
tuple of the plurality of tuples, and 

wherein the step of determining if the tuple is clustered with another tuple comprises 
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determining if the tuple is co-located with another tuple in a bucket of the tree. 

27. (Original) A computer for performing the method of claim 1 . 

28. (Original) A computer-readable medium having software for performing the method 
of claim 1. 

29. (Currently Amended) An apparatus for detecting similar documents comprising: 
means for obtaining a document; 

means for filtering the document to obtain a filtered document; 

means for determining a document identifier for the filtered document and a single hash 
value for the filtered document; 

means for generating a single tuple for the filtered document, the tuple comprising the 
document identifier for the filtered document and the hash value for the filtered document; 

means for comparing the tuple for the filtered document with a document storage 
structure comprising a plurality of tuples, each tuple in the plurality of tuples representing one of 
a plurality of documents, each tuple in the plurality of tuples comprising a document identifier 
and a hash value; and 

means for determining if the tuple for the filtered document is clustered with another 
tuple in the document storage structure, thereby detecting if the document is similar to another 
document represented by the another tuple in the document storage structure. 

30. (Currently Amended) A method for detecting similar documents comprising the steps 

of: 

obtaining a document; 

parsing the document to remove formatting and to obtain a token stream, the token stream 
comprising a plurality of tokens; 
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retaining only retained tokens in the token stream by using at least one token threshold; 
arranging the retained tokens to obtain an arranged token stream; 
processing in turn each retained token in the arranged token stream using a hash 
algorithm to obtain a single hash value for the document; 
generating a document identifier for the document; 

forming a single tuple for the document, the tuple comprising the document identifier for 
the document and the hash value for the document; 

inserting the tuple for the document into a document storage tree, the document storage 
tree comprising a plurality of tuples, each tuple located at a bucket of the document storage tree, 
each tuple in the plurality of tuples representing one of a plurality of documents, each tuple in the 
plurality of tuples comprising a document identifier and a hash value; and 

determining if the tuple for the document is co-located with another tuple at a same 
bucket in the document storage tree, thereby detecting if the document is similar to another 
document represented by the another tuple in the document storage tree. 

3 1 . (Original) A computer for performing the method of claim 30. 

32. (Original) A computer-readable medium having software for performing the method 
of claim 30. 

33. (Currently Amended) An apparatus for detecting similar documents comprising: 
means for obtaining a document; 

means for parsing the document to remove formatting and to obtain a token stream, the 
token stream comprising a plurality of tokens; 

means for retaining only retained tokens in the token stream by using at least one token 
threshold; 

means for arranging the retained tokens to obtain an arranged token stream; 
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means for processing in turn each retained token in the arranged token stream using a 
hash algorithm to obtain a single hash value for the document; 

means for generating a document identifier for the document; 

means for forming a single tuple for the document, the tuple comprising the document 
identifier for the document and the hash value for the document; 

means for inserting the tuple for the document into a document storage tree, the document 
storage tree comprising a plurality of tuples, each tuple located at a bucket of the document 
storage tree, each tuple in the plurality of tuples representing one of a plurality of documents, 
each tuple in the plurality of tuples comprising a document identifier and a hash value; and 

means for determining if the tuple for the document is co-located with another tuple at a 
same bucket in the document storage tree, thereby detecting if the document is similar to another 
document represented by the another tuple in the document storage tree. 

34. (Currently Amended) A method for detecting similar documents comprising the steps 

of: 

determining a single hash value for a document; 

accessing a document storage structure comprising a plurality of hash values, each hash 
value in the plurality of hash values representing one of a plurality of documents; and 

determining if the hash value for the document is equivalent to another hash value in the 
document storage structure, thereby detecting if the document is similar to another document 
represented by the another hash value in the document storage structure. 

35. (Original) A computer for performing the method of claim 34. 

36. (Original) A computer-readable medium having software for performing the method 
of claim 34. 



-10- 




Applicants: FRffiDERetal. 
Appl. No. 09/629,175 

37. (Currently Amended) An apparatus for detecting similar documents comprising: 
means for determining a single hash value for a document; 

means for accessing a document storage structure comprising a plurality of hash values, 
each hash value in the plurality of hash values representing one of a plurality of documents; and 

means for determining if the hash value for the document is equivalent to another hash 
value in the document storage structure, thereby detecting if the document is similar to another 
document represented by the another hash value in the document storage structure. 

38. (Currently Amended) A method for detecting similar documents comprising the step 

of: 

comparing a document to a plurality of documents in a document collection using a hash 
algorithm to generate a single hash value for the document and collection statistics to detect if the 
document is similar to any of the documents in the document collection. 

39. (Original) A method as in claim 38, wherein the collection statistics pertain to the 
document collection. 

40. (Original) A computer for performing the method of claim 38. 

41 . (Original) A computer-readable medium having software for performing the method 
of claim 38. 

42. (Currently Amended) An apparatus for detecting similar documents comprising: 
means for comparing a document to a plurality of documents in a document collection 

using a hash algorithm to generate a single hash value for the document and collection statistics 
to detect if the document is similar to any of the documents in the document collection. 
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43. (Previously Added) A method as in claim 1, wherein the step of filtering the 
document comprises the step of performing semantic filtering on the document. 
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